php删除掉txt重复行的代码

2022-09-13

498次阅读

写了个php版本的抓取api跳转图片链接的功能，写入到txt文本里面了。本来已经添加了去除重复行的代码，但是最后打开txt文本的时候手动选择性筛查发现不少的重复内容。

于是找了下面的代码进行去重，一套操作下来，重新打开对应的txt文本发现内容还是一样，一个没少，仔细看了下前缀不同，目标站使用的api是新浪外链，而新浪图床一张图片可以对应四个外链，于是就出现了名字重复但是链接前缀不同的情况。

出现这个情况通常只能php截取对应的名字来去除重复，而我又不太像写了。故此，拉倒吧，不干了。

下面是php版本的去除txt文本重复内容的代码，网上找了，手动测试了下，可以用。

/** 
 * RemoveDuplicatedLines 
 * This function removes all duplicated lines of the given text file. 
 * 
 * @param   string 
 * @param   bool 
 * @return  string 
 */ 
function RemoveDuplicatedLines($Filepath, $IgnoreCase=false, $NewLine="\n"){ 
  if (!file_exists($Filepath)){ 
    $ErrorMsg = 'RemoveDuplicatedLines error: '; 
    $ErrorMsg .= 'The given file ' . $Filepath . ' does not exist!'; 
    die($ErrorMsg); 
  } 
  $Content = file_get_contents($Filepath); 
  $Content = RemoveDuplicatedLinesByString($Content, $IgnoreCase, $NewLine); 
  // Is the file writeable? 
  if (!is_writeable($Filepath)){ 
    $ErrorMsg = 'RemoveDuplicatedLines error: '; 
    $ErrorMsg .= 'The given file ' . $Filepath . ' is not writeable!';   
    die($ErrorMsg); 
  } 
  // Write the new file 
  $FileResource = fopen($Filepath, 'w+');    
  fwrite($FileResource, $Content);     
  fclose($FileResource);   
} 
   
/** 
 * RemoveDuplicatedLinesByString 
 * This function removes all duplicated lines of the given string. 
 * 
 * @param   string 
 * @param   bool 
 * @return  string 
 */ 
function RemoveDuplicatedLinesByString($Lines, $IgnoreCase=false, $NewLine="\n"){ 
  if (is_array($Lines)) 
    $Lines = implode($NewLine, $Lines); 
  $Lines = explode($NewLine, $Lines); 
  $LineArray = array(); 
  $Duplicates = 0; 
  // Go trough all lines of the given file 
  for ($Line=0; $Line < count($Lines); $Line++){ 
    // Trim whitespace for the current line 
    $CurrentLine = trim($Lines[$Line]); 
    // Skip empty lines 
    if ($CurrentLine == '') 
      continue; 
    // Use the line contents as array key 
    $LineKey = $CurrentLine; 
    if ($IgnoreCase) 
      $LineKey = strtolower($LineKey); 
    // Check if the array key already exists, 
    // if not add it otherwise increase the counter 
    if (!isset($LineArray[$LineKey])) 
      $LineArray[$LineKey] = $CurrentLine;     
    else        
      $Duplicates++; 
  } 
  // Sort the array 
  asort($LineArray); 
  // Return how many lines got removed 
  return implode($NewLine, array_values($LineArray));   
}

下面是如何调用这个方法：

// Example 1 
// Removes all duplicated lines of the file definied in the first parameter. 
$RemovedLinesCount = RemoveDuplicatedLines('test.txt'); 
print "Removed $RemovedLinesCount duplicate lines from the test.txt file."; 
// Example 2 (Ignore case) 
// Same as above, just ignores the line case. 
RemoveDuplicatedLines('test.txt', true); 
// Example 3 (Custom new line character) 
// By using the 3rd parameter you can define which character 
// should be used as new line indicator. In this case 
// the example file looks like 'foo;bar;foo;foo' and will 
// be replaced with 'foo;bar'  
RemoveDuplicatedLines('test.txt', false, ';');

一般情况下不需要去重，或者直接使用 array_unique 这个函数来去重，目前就是用这个函数去重的，有效果，但遇到这种一个图片对应好几个链接的就没办法了，只能在这个函数执行之前加上筛选代码。等不嫌麻烦的时候我还是需要花点时间晚上这个功能的。

最近几天都是在写一套cms系统，准备把我的另一个 wordpress 博客迁移过去，目前文章已经全部迁移过去了，tag全部放弃，使用cms搜索功能做一个不进入数据的tag功能，全程使用 mysql 的like函数加缓存搞定。同时开发了静态化，目前静态化还在测试，因为我想把它打造成 wordpress 插件一样的静态文件放在一个统一的文件夹中，目前测试文章页和列表页已经实现了，tag页面因为加上page多层，放在统一文件夹中无法正常访问，所以还需要一点时间来打磨这个功能。

正文结束